import pandas as pd import numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)import matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier# add the additional libraries you need to import for ML hereLetsPlot.setup_html(isolated_frame=True)
Show the code
# Include and execute your code here# import your data here using pandas and the URLdf4 = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv')df4_maybe = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv')#df4
Elevator pitch
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
There is a good chance that we can predict aspestus issues with a home. Based on machine learning models we can show wether a home is built in certain ways indicating build procedure.
QUESTION|TASK 1
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
I decided to find some charts that would show contrast from year to year or in the case of the scatterplot, variety of datapoints that would need further interpretation. For the bar graphs showing quality of homes, we can conclude that there is a stark difference between whether the home was built before or after 1980.
Show the code
# Include and execute your code here#for quality: count freq of avg, excellent,fair before/after 1980# for letsplot: +coord_catesian(ylim = [min,max])#xscale_y_continuous(limits)#geom_freqpoly()import seaborn as snsimport matplotlib.pyplot as pltplt.figure(figsize=(10,6))sns.boxplot(x='before1980', y='basement', data=df4)plt.title('Finished Basement Distribution by Age Group')plt.show()
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
The first issue that was needed was to drop some columns that did not aid in the discussion of asbestos problems. Columns ‘parcel’ or ‘abstrprd’ did not appear to have any meaningful value. We test and train the model based on ‘before 1980’.
Show the code
# Include and execute your code hereX = df4.drop(columns=['parcel','abstrprd','yrbuilt','before1980']) y = df4['before1980'] X_train, X_test, Y_train, Y_test = train_test_split( X, y, test_size=0.2, random_state=42)print(X_train.shape, Y_train.shape)print(X_test.shape, Y_test.shape)
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
This data show the importance to the learning model for predicting the year that the homes were built. Displayed are the most important features of the home. Livearea, arechetecture style, bathrooms, stories are very important in this determination. This was able to be determined in the model.
Show the code
# Include and execute your code heretest_importances = test.feature_importances_importance_df = pd.DataFrame({'Feature': X_train.columns,'Importance': test_importances}).sort_values(by='Importance', ascending=False)importance_dftop_5_features = importance_df.head(5)print(top_5_features.head())plt.figure(figsize=(8, 5))plt.bar(top_5_features['Feature'], top_5_features['Importance'])plt.xticks(rotation=45, ha='right')plt.title('Top 5 Feature Importance')plt.tight_layout()plt.show()
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
From the scores we see this model the precision, accuracy, and f1 are notable. Looing at the F1 score we see that the model has good precision and recall, meaning that this learning model can correctly identify homes built before 1980 to a high degree. Accuracy is the proporiton of correct predictions: this model can mostly predict correctly.
Show the code
# Include and execute your code hereprediction = test.predict(X_test)from sklearn.metrics import classification_reportprint('Classification report for RandomForest')print(classification_report(Y_test, prediction))
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
I decided to try some linear regression. I’m not sure how good this data set would be. I saw it as an option so I decided to see how it would go. Looks like the f1 scores, precision, and accuracy don’t quite reach 0.9. Maybe the random forest was a better method.
Show the code
# Include and execute your code herefrom sklearn.linear_model import LogisticRegressionx = df4.drop(columns=['parcel','abstrprd','yrbuilt','before1980']) y = df4['before1980'] x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=42)print(x_train.shape, y_train.shape)print(x_test.shape, y_test.shape)logreg = LogisticRegression(max_iter=1500, random_state=42)logreg.fit(x_train, y_train) prediction = logreg.predict(x_test)from sklearn.metrics import classification_reportprint('Linear regression data')print(classification_report(y_test, prediction))
Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.
Linear regression still wasn’t showing the conclusions we needed but for the classifiers RandomForest and DecisionTree we can see a high f1 score, higher than the previous test before combining the data bases. This show how machine learning is powerful with more data.
Show the code
# Include and execute your code here# Join the datasets on the parcel columncombined_df = pd.merge(df4, df4_maybe, on='parcel', how='inner')print(f"Original dwellings_ml shape: {df4.shape}")print(f"Original neighborhoods_ml shape: {df4_maybe.shape}")print(f"Combined dataset shape: {combined_df.shape}")X_combined = combined_df.drop(columns=['parcel', 'abstrprd', 'yrbuilt', 'before1980'])y_combined = combined_df['before1980']X_train_c, X_test_c, y_train_c, y_test_c = train_test_split( X_combined, y_combined, test_size=0.2, random_state=42)print(f"Training data shape: {X_train_c.shape}")print(f"Testing data shape: {X_test_c.shape}")rf_model = RandomForestClassifier(n_estimators=400, criterion='gini', random_state=42)rf_model.fit(X_train_c, y_train_c)rf_pred = rf_model.predict(X_test_c)logreg_model = LogisticRegression(max_iter=1000, random_state=42)logreg_model.fit(X_train_c, y_train_c)logreg_pred = logreg_model.predict(X_test_c)dt_model = DecisionTreeClassifier(random_state=42)dt_model.fit(X_train_c, y_train_c)dt_pred = dt_model.predict(X_test_c)print("Random Forest with combined data:")print(classification_report(y_test_c, rf_pred))print("\nLogistic Regression with combined data:")print(classification_report(y_test_c, logreg_pred))print("\nDecision Tree with combined data:")print(classification_report(y_test_c, dt_pred))
Original dwellings_ml shape: (22913, 51)
Original neighborhoods_ml shape: (22913, 274)
Combined dataset shape: (27961, 324)
Training data shape: (22368, 320)
Testing data shape: (5593, 320)
Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.
I’m hoping this code makes sense.I found a linear regression model that could work for our purposes here. I’m hoping that setting the ‘yrbuilt’ column will test to parameterize the years, I’m hoping the model can pick up patterns. This code takes much longer to run. I’m skeptical if the model worked well because I have an Rsquared score of 1…which should not be possible. CORRECTION: I forgot to drop the yrbuilt column in the training. This model is only OKAY. It has a higher r2 score of 0.78 but we cannot give a solid confidence if being able to predict the year built can be based on this model.